[1] "/Users/wenhaochen/Desktop/statistics/EDA_Course_Materials/Final_project"
'data.frame': 4898 obs. of 13 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
X fixed.acidity volatile.acidity citric.acid
Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
residual.sugar chlorides free.sulfur.dioxide
Min. : 0.600 Min. :0.00900 Min. : 2.00
1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
Median : 5.200 Median :0.04300 Median : 34.00
Mean : 6.391 Mean :0.04577 Mean : 35.31
3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
Max. :65.800 Max. :0.34600 Max. :289.00
total.sulfur.dioxide density pH sulphates
Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
alcohol quality
Min. : 8.00 Min. :3.000
1st Qu.: 9.50 1st Qu.:5.000
Median :10.40 Median :6.000
Mean :10.51 Mean :5.878
3rd Qu.:11.40 3rd Qu.:6.000
Max. :14.20 Max. :9.000
This dataset contains 12 variables, with 4898 observations.
3 4 5 6 7 8 9
20 163 1457 2198 880 175 5
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 5.000 6.000 5.878 6.000 9.000
In this dataset, most of samples are rated as 5 or 6, which are nomral wines. The mean of wine ratings are 5.878. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.
I selected four varialbes which I am interested in to plot: volatile acidity, alcohol, chlorides and density.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
1)The density value are pretty much concentrating between 0.9871 to 1, the value distribution appears normal.
2)The alcohol values spread between 8% to 14%, but most wine samples have alcohol content between 9% to 13%.
3)The chlorides values are concentrating between 0.01 to 0.1, but this variable seems have a lot of outliers on the right side. The boxplot proves what I thought. To make the data distribution appear more normal, log10() transform was applied. 4)For volatile adicity attribute, most of values are between 0.2 and 0.3. However this attribute’s distribution has relatively longer tail on the right side. There are 170 samples containing more than 0.5 g / dm^3 volatile adicity. I wonder if the wine containing high volatile adicity tend to be rated lower, since the too high levels of sulfur dioxide will lead to an unpleasant, vinegar taste. 5)Just like volotile adicity and chlorides, the distribution of sugar value is pretty right-skewed. Log-transformation makes the distribution appears bimoda.
I also created a new variale ‘perfree’ to represent the percent of free SO2 in total SO2. I am curious if this new variable is correlated with wine quality.
There are 4898 white wine test samples in the dataset with 12 features, which are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality.
The main features of interest in this dataset are wine quality and acohol content. I want to investigate which features can be used to best predict wine quality. According to my online research, the wine alcohol content is talked about a lot by wine expert as well as the normal consumers, so I am interested to know how the alcohol level will impact the wine quality.
In addition to the alcohol, I think other features such as wine density, volatile acidity and chlorides can also impact the quality of wine. For example, as the data documentation mentions, the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
A new varialbe was created to represent the percent of free sulfur dioxide in total sulfur dioxide.
The variable chlorides distribution appears right-skewed, so I log-transformed this variable. After the transformation, the chlorides content distribution looks more normal.
I want to see how correclated the different values are.
'data.frame': 4898 obs. of 13 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
$ perfree : num 0.265 0.106 0.309 0.253 0.253 ...
According to the plot matrix, attribute fixed acididty, critic acid, residual sugar, free sulfur dioxide, pH, sulphates do not seem have strong correlations with wine quality. But alcohol, density, chlorides and percent of SO2 in total SO2 are moderately correlated with wine quality.
Our goal is to investigate which attributes has biggest impact on wine quality and how they impact, but before I conduct further analysis between attributes and quality, I wanted to look at how the feature attributes are correlated. I selected two paris of variables which are highly correlated with each other: density and alcohol, density and sugar.
This plot shows a clear relation between density and alcohol. With density of wine increasing, the alchohol content decrease.
Density and residual sugar are strongly and positively related, which makes sense as wine density is mainly depending on sugar and alcohol content.
Next I am going to plot the wine quality against alcholol content and density, since these two attributes have highest correlation coefficient with wine quality.
winedf$quality: 3
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.00 9.55 10.45 10.34 11.00 12.60
--------------------------------------------------------
winedf$quality: 4
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.40 9.40 10.10 10.15 10.75 13.50
--------------------------------------------------------
winedf$quality: 5
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.000 9.200 9.500 9.809 10.300 13.600
--------------------------------------------------------
winedf$quality: 6
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.50 9.60 10.50 10.58 11.40 14.00
--------------------------------------------------------
winedf$quality: 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.60 10.60 11.40 11.37 12.30 14.20
--------------------------------------------------------
winedf$quality: 8
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.50 11.00 12.00 11.64 12.60 14.00
--------------------------------------------------------
winedf$quality: 9
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 12.40 12.50 12.18 12.70 12.90
The alcohol content and quality are highly correlated according to the correlation matrix. As the point plot and boxplot shows, the wine which are rated as 5 has lowest median alcohol content. But for wine which are rated above mean value, the wine quality tend to improve as the alcohol increases.
winedf$quality: 3
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9911 0.9925 0.9944 0.9949 0.9969 1.0000
--------------------------------------------------------
winedf$quality: 4
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9892 0.9926 0.9941 0.9943 0.9958 1.0000
--------------------------------------------------------
winedf$quality: 5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9872 0.9933 0.9953 0.9953 0.9972 1.0020
--------------------------------------------------------
winedf$quality: 6
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
--------------------------------------------------------
winedf$quality: 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
--------------------------------------------------------
winedf$quality: 8
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
--------------------------------------------------------
winedf$quality: 9
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9896 0.9898 0.9903 0.9915 0.9906 0.9970
The density seems has negative impact on wine quality.As the plot shows,the better wine usually has lower density.
There is a tendency of percent of free SO2 in total SO2 among different wine qualities. The good quality wine tends to have higher percent of free SO2. As the plot shows, the wine scores that are beyond the median quality value of 6, tend to have free SO2 percent value beyond the median value of free SO2 percent. However, compared with alcohol and density, the difference of free SO2 percent value among different wine quality group is not that evident.
[1] -0.2728567
[1] -0.2099344
In general, the wine quality increase as the chlorides decreases. However, as the boxplot shows, there are many Chlorides outlier values in quality 5 and 6. In addition, after the chlorides variable is log-transformed, the absolute correlation coefficient between chlorides and wine quality increases from 0.2 to 0.27.
Among all the investigated variables, the wine quality is most related to alcohol content, with correlation coefficient of 0.44. With wine quality above 5, the quality tends to improve as the alcohol content increases.
In addition to alcohol content, the wine quality is also highly related to wine density. The better wine usually has lower density.
There is a strong relation between wine density and residual sugar, which is expected since the data documentation has mentioned that density of water is depending on the percent of alcohol and sugar content. The wine density increases as the residual sugar content. Besides, the wine density and alcohol content are also strongly and negatively correlated. There is an obvious tendency that the wine density decreases as the alcohol content increase. This strong relation concerns me since I am planning to incoporate both density and alcohol content into predictive model. It could introduce multicollinearity issue.
The alcohol content is moderately and positively correlated with wine quality. The density of wine also correlates with wine quality, but less than alcohol content.
Alcohol content and wine density are highly correlated with each other, which may cause Multicollinearity issue when building predictive model with those two attributes.
There are more higher quality wine samples (rate > 6) located on the upper left section than the other three sections, which is corresponding to my previously analysis that wine quality is positively related to alcohol, but negatively related to density.
Most of higher quality sample points are clustering on low-chlorides low-density section of this point plot.
Among all those variables, alcohol content is most strongly related to wine quality, but the chlorides can also explain some variations. As above plot show, holding alcohol constant, most of high quality wines samples are below the mean line of chlorides values.
It’s hard to see there is a pattern of wine quality distriubtion along the perfree variable (percent of free SO2 in total SO2). With alcohol constant, the high quality wine seems evenly distributes along the Y exis.
winedf$quality: 3
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
--------------------------------------------------------
winedf$quality: 4
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
--------------------------------------------------------
winedf$quality: 5
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.100 0.240 0.280 0.302 0.340 0.905
--------------------------------------------------------
winedf$quality: 6
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
--------------------------------------------------------
winedf$quality: 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
--------------------------------------------------------
winedf$quality: 8
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
--------------------------------------------------------
winedf$quality: 9
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.240 0.260 0.270 0.298 0.360 0.360
No strong correlation was observed between wine quality and volatile acidity. The high quality wine seems distributes evenly along the volatile acidity, which surprises me since it’s said that high level of acetic acid would lead to an unpleasant taste.
Alcohol content is most strongly correlated with wine quality, but other variables can also contribute to the quality variation. For instance, holding alcohol variable constant, most of high quality wines samples have chlorides below the mean.
I was expecting that volotile acidity content would have a impact on wine quality since the data documentation mentions the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. However, there is no obvious pattern of wine quality distribution along volitile acidity.
3 4 5 6 7 8 9
20 163 1457 2198 880 175 5
Most of wine samples are rated as 5 or 6, which are nomral wines. The mean of wine ratings is 5.878. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.
The alcohol content and wine quality are highly correlated according to the correlation matrix. As the point plot and boxplot shows, the wine which are rated as 5 has lowest median alcohol content. But for wine which are rated above mean value, the wine quality tend to improve as the alcohol increases.
There are more higher quality wine samples (rate > 6) located on the lower left section than other sections, which means the quality is negatively related to wine density and chlorides content.
This white wine dataset contains 4898 samples, with 12 attribute meansured. In this dataset, most of samples are rated as 5 or 6, which are nomral wines. The excellent or poor wines are much less, with only 5 samples rated as 9 and 20 as 3.
The alcohol contecnt is discussed about a lot by wine expert as well as normal consumers, so my initial interest is to explore the relation between wine quality and alcohol content. By calculating the correlation coefficient and plotting quality and alcohol variable, I found wine quality is moderately correlated with alcohol content and better wines usually have higher alcohol content.
Obviously, the alcohol content is not the only factor determining the wine quality. So I explored other variables such as density, chlorides and volatile acidity, percent of free SO2 in total SO2, which could possibly impact the wine quality. I could see there is a trend between density, chlorides and quality but I was suprised that there is no abvious pattern of quality distribution against volatile acidity. What I expected was that wine with high level of volatile acidity would be rated lower since too high level of volatile acidity could lead to an unpleasant taste.
According to the correlation matrix, both alcohol and density are correlated with wine quality, but in the meanwhile these two variables are also highly correlated with each other. In the future if a preditive model will be built, incoporating these two variables will introduce multicollinearity issue. So further analysis needs to be conducted before using these two variables in model.